Title

Text copied to clipboard!

Site Reliability Engineer SRE

Description

Text copied to clipboard!

We are looking for a Site Reliability Engineer to join our growing technology team. As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our systems and services. You will work closely with software engineers, system administrators, and other stakeholders to build and maintain scalable infrastructure, automate operational tasks, and respond to incidents effectively. The ideal candidate will have a strong background in systems engineering, cloud infrastructure, and software development. You will be expected to design and implement monitoring solutions, develop tools to improve system reliability, and participate in on-call rotations to address production issues. Your work will directly impact the user experience by minimizing downtime and ensuring high availability of our services. As an SRE, you will also be responsible for conducting post-incident reviews, identifying root causes, and implementing long-term solutions to prevent recurrence. You will champion best practices in system design, deployment, and maintenance, and help foster a culture of reliability and continuous improvement across the organization. This role requires excellent problem-solving skills, a proactive mindset, and the ability to work in a fast-paced, collaborative environment. If you are passionate about building reliable systems and enjoy working at the intersection of software and operations, we encourage you to apply.

Responsibilities

Text copied to clipboard!

Design, build, and maintain scalable and reliable infrastructure
Develop and implement monitoring and alerting systems
Automate operational tasks and improve system efficiency
Participate in on-call rotations and respond to incidents
Conduct root cause analysis and post-incident reviews
Collaborate with development teams to improve system architecture
Ensure high availability and performance of services
Implement security best practices and compliance standards
Manage cloud infrastructure and deployment pipelines
Continuously improve system reliability and operational processes

Requirements

Text copied to clipboard!

Bachelor’s degree in Computer Science or related field
3+ years of experience in Site Reliability Engineering or DevOps
Strong knowledge of Linux/Unix systems
Experience with cloud platforms such as AWS, GCP, or Azure
Proficiency in scripting languages like Python, Bash, or Go
Familiarity with CI/CD tools and practices
Experience with monitoring tools like Prometheus, Grafana, or Datadog
Understanding of networking, security, and system architecture
Excellent troubleshooting and problem-solving skills
Strong communication and collaboration abilities

Potential interview questions

Text copied to clipboard!

What experience do you have with cloud infrastructure?
Can you describe a time you resolved a major system outage?
What monitoring tools have you used in previous roles?
How do you approach automating repetitive tasks?
What is your experience with CI/CD pipelines?
How do you ensure system security and compliance?
Describe your experience with on-call rotations.
What scripting languages are you most comfortable with?
How do you handle post-incident reviews?
What strategies do you use to improve system reliability?

Title

Site Reliability Engineer SRE

Description

Responsibilities

Requirements

Potential interview questions

Needed Skills

Related Job Descriptions